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(54) Checkpoint computer system 

(57) A computer system which can achieve rollback 
operation when a fault occurs in the system without wait- 
ing for side-tracking of pre-update data during updating 
of a file. When a file write request has been made, 'file 
writing Information* pertaining to the file write Is saved 
In a pending queue and only a primary file Is immediately 
updated. After a checkpoint has been acquired, the 'file 
writing Information' saved In the pending queue Is shift- 
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ed to a confirmed queue, and is then written to a back- 
up file. When performing recovery, all pre-update data 
which corresponds to the data which has been updated 
since the last checkpoint acquired is read from the back- 
up file, based on the file writing Information" saved in 
the pending queue. The primary file is then restored to 
its state at the checkpoint time by using the pre-update 
data which has been read from the backup file. 
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Description 

BACKGROUND OF THE INVENTION 
Field of the Invention : 

The present Invention relates to checkpointing com- 
puter systems suitable for application to group comput- 
ing processes in networlc-connected computing envi- 
ronments and a method for managing files In the com- 
puter system. In particular, the present invention relates 
to a duplicated checl(pointing computer system suitable 
for application in a fault tolerant computer system and 
method for improving efficiency of file updates in the 
fault tolerant computer system. 

More particularly, the present invention relates to a 
network-connected computer system and file manage- 
ment method suitable for applications such as database 
processes and transaction processes which require 
high reliability of the network-connected computer sys- 
tem. 

Further nrK)re particularly, the present inventbn re- 
lates to a duplicated checkpointing computer system 
suitable for application to construction of a fault tolerant 
computer system which is comprised of a working sys- 
tem and a standby system in which checkpoints are ac- 
quired and saved in both the working system and the 
standby system, and a file management method that 
achieves high efficiency in updating files of the duplicat- 
ed checkpointing computer system. 

Still further, the present invention relates to multiple 
checkpointing computer systems in a network-connect- 
ed computing environment and a file management 
method that achieves high efficiency in updating files of 
multiple systems by eliminating unnecessary reading of 
pre-update data from files rf a failure is detected during 
execution of a process. 

DISCUSSION OF BACKGROUND 

Fault tolerant computer systems, often reffered to 
as checkpointing computer systems, utilize checkpoints 
which are periodically acquired in order to recover from 
a fault or failure which may be detected in operatbn be- 
tween any two checkpoints. In a checkpointing compu- 
ter system, states such as an address space and con- 
text informatbn for each operation, and files of each 
process, are periodically stored (this operation is re- 
ferred to as a 'checkpointing of uniting process" or sim- 
ply as checkpointing) for recovering from the failure. 

When a failure occurs, a state of the last checkpoint 
acquired is restored and execution of the process is re- 
started from that checkpoint. In conventional check- 
pointing computer systems, there have been problems 
and difficulties, in particular, concerning external input/ 
output processings. When a process restarts execution 
from the last acquired checkpoint due to detection of a 
fault, states such as an address space of the process 



and context information in the processor are easily re- 
stored. However, restoration of states of external input 
devices was not so simple. 

For instance, it was impossible to cancel writing op- 
5 eration to the files in the system. Consequently, in the 
prior art, when file writing operations are performed, 
reading and saving of data in the file up to that writing 
are performed in advance of the writing operation. After 
completion of the saving of the prior data, the writing 
10 operation writes new data to the file. 

Fig. 15 explains the writing operation to a file in a 
conventional checkpointing computer system. In this ex- 
ample, a writing operation is performed. In this writing 
operation, data that is to be written on is read and saved 

15 as rollback information. Then, the writing operation is 
performed writing data to the file. 

In this example, a process file having 4 bytes of data 
■ABCD" is acquired at a checkpoint of time tl. At time t2, 
a command Is received to write "X" to the 1st byte posi- 

20 tion in the file (1). Before writing "X" to the 1st byte po- 
sition of the file, original data "B" in the 1st byte positron 
in the file is read out and saved as a roll back Information 
(this operation is referred to as an 'undo log') (2). After 
saving the roll back informatbn, the data "X" is written 

25 into the 1 st byte positbn of the file (3). When a fault oc- 
curs at time t3, the process is rolled back to the state of 
the last acquired checkpoint tl. Although the file has al- 
ready been updated by "X" in the 1st byte position at 
time tS, the file state at the checkpoint tl is restored by 

30 using the undo log (4). The undo log is destroyed at the 
time of the next checkpoint acquisition. 

This method is also applicable to dupl bated com- 
puter systems comprised of two computers, one as the 
working system (a primary computer) and the other as 

35 the standby system (a back-up computer). When a fault 
occurs in the primary computer, the back-up computer 
takes over the process by utilizing the roll back file in- 
formation which was acquired at the last checkpoint. 
As described above, the checkpointing computer 

40 system (no matter whether it be duplbated or not) can 
improve the reliability of a computer system by utilizing 
perbdically acquired process states and the files. How- 
ever, as explained above, when performing updates (for 
instance, writing of files), the conventional system and 

45 method have reduced efficiency because data must be 
read and logged before updating to the file. 

SUMMARY OF THE INVENTION 

so Accordingly, it is an object of the present invention 
to provide a checkpoint based computer system and 
method for managing files for achieving a higher effi- 
ciency of updating files while maintaining a fault tolerant 
computer system. 

55 It is another object of the present invention to pro- 
vide a checkpoint computer system for uniting process- 
es and a file management method for achieving a higher 
updating efficiency of files while maintaining a fault tol- 
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erant computer system. 

It is a further object of the present invention to pro- 
vide a duplicate computer system and file nnanagement 
method for achieving a higher updating efficiency of files 
while maintaining fault tolerance in the duplicate com- 
puter system. 

It is a still further object of the present invention to 
provide a duplicate checkpointing computer system and 
method for updating data in a working system file when 
a process is aborted, by reading p re-update data from 
a standby system file for restoration of a working system 
file to allow re-execution of the aborted process from a 
last acquired checkpoint, thereby restoring normal 
processing for completbn of the aborted process with- 
out delay, and improving file updating efficiency. 

It is a still further object of the present invention to 
provide a computer system and a file management 
method which makes unnecessary the reading of pre- 
update data from files when performing routine file up- 
dates, and thus makes possible a significant improve- 
ment in file updating efficiency. 

These and other objects are achieved according to 
the present invention by providing a computer system 
having a failure recovery function that periodically ac- 
quires checkpoints and restarts a process from a last 
checkpoint when a fault occurs. 

Also, instead of restoring the working system file, it 
is also effective to re-execute the process from a check- 
point by using the standby system file, in which the 
whole update content indicated in the update informa- 
tion saved before the last checkpoint, is reflected. That 
is to say, continuation of processing is also guaranteed 
in such cases as re-starting when the working syistem 
file is not available due to a fault in the working system 
computer, for example. Therefore, availability of the sys- 
tem can be improved. Also, if the standby system file is 
updated on a third computer, it is possible to further im- 
prove system availability. 

BRIEF DESCRIPTION OF THE DRAWINGS 

A more complete appreciation of the present inven- 
tion and many of the attendant advantages thereof will 
be readily obtained as the same becomes better under- 
stood by reference to the following detailed description 
when considered in connection with the accompanying 
drawings, wherein; 

Fig. 1 is a block diagram for explaining the basic 
theory of the present invention. 

Fig. 2 is a block diagram showing the system com- 
position of the computer system in a preferable first em- 
bodiment according to the present invention. 

Fig. 3 is a schematic diagram illustrating the com- 
puter system of the first embodiment according to the 
present invention. 

Fig. 4 is a schematic diagram illustrating how a file 
is updated in the first embodiment according to the 
present invention. 



Fig. 5 is a schematic diagram illustrating how the 
primary file is restored in the first embodiment of the 
present invention when a fault occurs. 

Fig. 6 is a flow-chart that illustrates the processing 
5 of a file-writing operation in the first embodiment of the 
present invention. 

Fig. 7 is a flow-chart that illustrates processing of a 
checkpoint acquisition operatk>n in a file operating unit 
of the first embodiment of the present invention. 
10 Fig. 8 is a flow-chart that illustrates processing of 
the file updating operatk)n in a back-up unit of the first 
embodiment of the present invention: 

Fig. 9 is a flow-chart that illustrates processing of a 
restarting operation from the last acquired checkpoint in 
the primary computer when a fault, such as an abort, 
has occurred in the process of the first embodiment of 
the present invention. 

Fig. 10 is a flow-chart illustrating the process of re- 
storing address space and processor context in the pri- 
20 mary computer of the first embodiment of the present 
invention. 

Fig. 1 1 is a flow-chart illustrating the process of res- 
toration of the primary file in the first embodiment of the 
present invention. 
25 Fig. 1 2 is a drawing that illustrates how the back-up 
file takes over processing when a fault has occurred in 
the first embodiment according to the present invention. 

Fig. 13 is a block diagram illustrating the composi- 
tion of the computer system of a second embodiment 
30 according to the present invention. 

Fig. 14 is a schematic diagram illustrating compo- 
sition of the computer system of the second embodi- 
ment according to the present invention. 

Fig. 1 5 is a schematic diagram that explains a data 
35 writing operation into file of a conventional checkpoint- 
ing computer system. 

Referring now to the drawings, wherein like refer- 
ence numerals designate identk:al or corresponding 
parts throughout the several views, and more particu- 
40 larly to Figure 1 thereof, there is illustrated a conceptual 
block diagram for explaining the basic theory of the 
present invention. 

As shown in Figure 1 , the computer system accord- 
ing to the present inventbn takes as pre-requisite a 
45 working system 1 0 (a multiplexed system, for example) 
and a standby system 20. The following is an explana- 
tion of their respective operations. 
During normal processing: 



50 



55 



(1) In the working system 10, an application pro- 
gram 11 issues a "write" system call. 

(2) A jacket routine 12 hooks the "write" system call 
and issues the "write" system call to an operating 
system of the working system. By this system call, 
a file on the DISK 14 in the working system 10 is 
updated through the OS BUFFER CACHE 13. At 
the same time, it transmits that "write" request to 
the standby system 20. However, it is not necessary 
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to transmit the "write" request to the standby system 
20 immediately. It may be transmitted up to the next 
checkpoint acquisition. Also, the "write" request re- 
ceived by the DEMON process 21 in the standby 
system 20 is not immediately executed, but is 
stored temporarily in a pending queue 211. 

(3) When a designated checkpointing process Is in- 
structed, the working system 10 must complete 
transmission of all queued "write" requests to the 
standby system 20. 

(4) At the same time, in the standby system 20, any 
"write" requests stored in the pending queue 211 
are nrraved to a confirmed queue 212. 

(5) The "write" requests moved to the confirmed 
queue 21 2 are sequentially processed by the oper- 
ating system of the standby system 20. Namely, the 
operating system updates a file on the DISK 23 In 
the standby system 20 through the OS BUFFER 
CACHE 22. That Is to say, in a file updating opera- 
tk>n which Is generated during normal processing, 
there Is no waiting for completion of a process which 
reads out and skJetracks pre-update data. 

During rolling back-back process: 

(3) ' At times such as when a fault has occurred, the 
roll-back process Is instructed in both working sys- 
tem 10 and standby system 20. 

(4) ' At this time, any "write" request remaining in the 
working system 1 0 are all transmitted to the standby 
system 20. Then, all "write" requests stored In the 
confirmed queue 212 In the standby system 20 are 
executed. That Is, all "write" requests are written to 
a file in the standby system 20. Also, any "write" re- 
quests stored In the pending queue 211 of the 
standby system are those which have been issued 
since the last checkpoint. Therefore, conversely, 
the pre-update data Is read from a standby system 
file 23 with reference to the stored "write" requests. 
The working system file 14 Is rolled-back by using 
the pre-update data read from the standby system 
file 23. By this means, both the working system file 
14 and the standby system file 23 are placed into a 
state at the time of the last checkpoint acquisition. 

(5) ' Then, the standby system 20 cancels all the re- 
maining "write" requests In the pending queue 211. 
By this means, It becomes possible to re-start the 
process from the time of the checkpoint. 

The following are descriptions of some preferred 
embodiments of the present invention. 

First Preferred embodiment: 

Fig. 2 Illustrates a computer system of the first em- 
bodiment according to the present invention. As shown 
In Fig. 2, the computer system is duplicated as a primary 
computer 30 and a back-up computer 40. These primary 



and back-up computer systems are coupled through a 
network 50. The primary and back-up computers 30 and 
40 are respectively provided with both the above-men- 
tioned working system 1 0 and standby system 20. When 
5 the working system 1 0 Is operating In either one of them, 
the standby system 20 is operating In the other. 

Here, the case is described of the working system 
1 0 on the primary computer 30 side and the standby sys- 
tem 20 on the back-up computer 40 side respectively. A 
10 process 35 Is executed on the primary computer 30 and 
it updates duplicated files as a primary file 39 and a 
back-up file 41 . The primary file 39 is provided in the 
primary computer 30, and the back-up file 41 is provided 
in the back-up computer 40. They are updated through 
a file system 36 in the primary computer 30 and a file 
system 48 in the back-up computer 40. 

The file system 36 in the primary computer 30 con- 
tains a primary file operating unit 38 and a primary file 
restoration unit 37. The file system 48 in the back-up 
computer 40 contains a back-up file operating unit 43, 
a pending queue 431 , a confirmed queue 432, a back- 
up file updating unit 44, and a primary file restoration 
Information reading unit 42. 

When the process 35 updates the duplicated files, 
it performs the updating operation through the primary 
file operating unit 38 and back-up file operating unit 43. 
If the process 35 performs a "write" corresponding to the 
duplicated files, the primary file 39 Is immediately up- 
dated. However, the back-up file 41 is not updated at 
that point in time, and the file writing infonnation", or 
write request passes through the back-up file operating 
unit 43 and is saved in the pending queue 431 in the 
back-up computer 40. 

Also, when the process 35 acquires a checkpoint, 
a checkpointing control unit 31 Issues a checkpoint ac- 
quisition instruction to a checkpoint information saving 
unit 32 and a primary file operating unit 38. When the 
checkpoint information saving unit 32 receives the 
checkpoint acquisition instruction, it saves checkpoint 
information (e.g. , address memory space and processor 
context) on the primary computer 30 and on the back- 
up computer 40. That Is, the checkpoint information Is 
saved in a checkpoint infonnation unit 34 on the primary 
computer 30 and also saved in a checkpoint information 
unit 45 on the back-up computer 40. 

At the same time, when the primary file operating 
unit 38 receives the checkpoint acquisition Instruction, 
it causes the "file writing infomnation" saved in a pending 
queue 431 to be moved to a confirmed queue 432 
through the back-up file operating unit 43. The "file writ- 
ing information", which is moved to the confirmed queue 
432, is used for updating the back-up file 41 by the back- 
up file updating unit 44 after checkpoint acquisition, and 
is dumped after updating the back-up file 41. By this 
means, the same "write" operations as those which 
have been performed on the primary file 39 from a 
checkpoint onward are also performed on the back-up 
file 41. 
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In the case of the process 35 causing the occur- 
rence of a fault, such as an abort, the process 35 is re- 
executed on the primary computer 30 from the last ac- 
quired checkpoint, the address space and processor 
context having been restored by a checkpoint informa- 
tion restoration unit 37 on the primary computer 30. 

Concerning the files, the update information for the 
back-up file 41 from the last checkpoint onward Is still 
only saved as "file writing information" in the pending 
queue 431 . Since it has not actually been updated, res- 
toration is not required. However, restoration is required 
for the primary file 39 because updating has already 
been performed from the last checkpoint onward. 
Therefore, it is restored by reading the p re-update data 
from the back-up file 41 , and writing the pre-update data 
to the primary file 39 based on the "file writing informa- 
tion' saved in the pending queue 431 . After this opera- 
tion, the file writing information" saved in the pending 
queue 431 is dumped. In the case of any "file writing 
information" being saved in the confirmed queue 432, 
the above-mentioned restoration process starts after 
completion of writing the confirmed queue "file writing 
information" to the back-up file 41 . 

On the other hand. In the case of the primary com- 
puter 30 or an operating system which controls the pri- 
mary computer 30 causing the generation of a faurt such 
as "system down", the process 35 is re-executed on the 
back-up computer 40 from the last checkpoint acquisi- 
tion, the address space and the processor context are 
restored to the process 47 by a checkpoint information 
restoration unit 46 on the backup computer 40. 

Concerning the files, the update for the back-up file 
41 from the checkpoint onward is still only saved in the 
pending queue 431 , and since it has not actually been 
updated, restoration is not required. 

It is possible to optimize the transmission of this "file 
writing infonmatran" from the primary computer 30 to the 
back-up computer 40. In the case of the primary com- 
puter 30 not having gone down when a fault has oc- 
curred, the primary file 39 is restored, and processing is 
restarted from the checkpoint by using the primary file 
39. On the other hand, in the case of the primary com- 
puter 30 going down when a fault has occurred, 
processing is restarted from the checkpoint by using the 
back-up file 41 . 

For that reason, there is no requirement for the im- 
mediate transmissran of the file writing information" 
from the primary file operating unit 38 to the back-up file 
operating unit 43. That is to say, this file writing infor- 
mation" may be transmitted up to the next checkpoint 
acquisition. Therefore, in consideration of transmission 
efficiency, it is possible to store the "file writing informa- 
tion" temporarily in the primary file operating unit 38 and 
to batch-transmit it to the back-up file operating unit 43, 
using information relating to a specified volume of stored 
information, a specified elapsed time, or a checkpoint 
sampling request as triggers. 

Fig. 3 shows an application of the computer system 



of the first embodiment. The computers are duplicated 
as a primary computer 30 and a back-up computer 40. 
A disk device 60a is connected to the primary computer 
30, and a disk device 60b is connected to the back-up 

s computer 40. The process 35 in Fig. 2 is being executed 
on the primary computer 30. Also, the file which process 
35 is accessing is duplicated as a primary file 39 and a 
backup file 41. These are respectively provkled in the 
disk device 60a and the disk device 60b. 

10 Also, for the checkpoints, the checkpoint informa- 
tion is stored on both the prinnary computer 30 as a pri- 
mary checkpoint infornnation 34 and the back-up com- 
puter 40 as a back-up checkpoint information 45. In this 
embodiment, these checkpoints are held in the respec- 

is tive disk devices 60a and 60b. It is also possible to store 
them in the memories of the computer system. 

If a fault such as "system down" is detected in the 
primary computer 30 or in the operating system whch 
controls the primary computer 30, a process 47 is reex- 

20 ecuted on the back-up computer 40 by using back-up 
checkpoint information 45 and the back-up file 41 in the 
disk 60b. 

Also, it is possible to have multiple copies of primary 
file 39 or back-up file 41 and to produce a triplicate or 
25 greater file system. If, for instance, it is a triplicate file 
system, combinations such as the fol towing can be con- 
sidered. 



Fig. 4 shows a file updating operation in this em- 
bodiment. In this example, a process 35 which is exe- 
cuting on the primary computer 30 has written "X" to the 

35 1st byte position at a time tl in a duplicated file, i.e., a 
primary file 39 on the primary computer 30 and a back- 
up file 41 on the back-up computer 40, the duplicated 
files each containing 4 bytes of data "ABCD" prior to writ- 
ing "X"(1) - Although the primary file 39 is immediately 

40 updated by this means, the back-up file 41 is not imme- 
diately updated, and saves only the "file writing informa- 
tion" pertaining to the writing of "X". 

After this, the execution of the file writing infornna- 
tion" pertaining to the writing of "X" is confinned at a 

45 checkpoint acquisition time t2 (2). Then, updating of the 
back-up file 41 is executed from the time t2 onward 
based on the confirmed "file writing information". 

Fig. 5 depicts how a prlnnary file is restored when a 
fault occurs in this embodiment. In this example, the 

50 process 35 which is executing on the primary computer 
30 has written "X" to the 1st byte position at the time tl 
in a duplicated file, i.e., the primary file 39 on the primary 
computer 30 and the back-up file 41 on the back-up 
computer 40 which have 4 bytes of data "ABCD" (1). 

55 Although the prinnary file 39 is immediately updated by 
this means, the back-up file 41 is not immediately up- 
dated, and saves only the "file writing information". 
After this, a fault occurs at the time t2 (2). Since the 



(1 ) 2 primary files and 1 back-up file 
30 (2) 1 prinnary file and 2 back-up files 
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primary file 39 was updated at the time tl, there is a re- 
quirement to restore rt. However, since the back-up file 
41 has not yet been updated, there is no requirement to 
restore that. Accordingly, the data needed for restoring 
primary file 39 is determined by the file writing informa- s 
tion' saved at the time tl. Therefore, the primary file 39 
Is restored by reading the data at the position indicated 
by the pending "file writing information" from the back- 
up file 41 and writing it to the primary file 39. 

Then, by using a checkpoint acquired on the prinna- io 
ry computer 30, the process 35 is re-executed on the 
primary computer 30. This re-executed process 35 uses 
the restored primary file 39. 

Fig. 6 is a flowchart showing the processing flow 
when a file write instruction is received by the primary is 
file operating unit. At first, the "file writing information" 
is saved and it is linked to the pending queue 431 (Step 
A1 ). Then, the primary file 39 is updated in accordance 
with the Hie writing infomnation" (Step A2). At this time, 
the file writing operation is completed and completion 20 
notification is sent to a requesting side (Step A3). 

Fig. 7 is a flow-chart showing the processing flow 
when a checkpoint acquisition instructk)n has been re- 
ceived by the file operating unit. As shown In Step B1 , 
the saved "file writing infomnation" is moved from the 2S 
pending queue 431 to the confirmed queue 432. 

Fig. 8 is a flow-chart showing the processing flow 
of the back-up file updating unit. At first, it checks wheth- 
er or not the "file writing information" is linked to the con- 
firmed queue 432 (Step CI). It it is not so linked (N branch 30 
of Step CI), the back-up file updating unit 44 continues 
checking. If it is so linked (Y branch of Step CI), it up- 
dates the back-up file 41 based on the "file writing infor- 
mation" linked in the confirmed queue 432 (Step C2). 
Then, it renrraves the executed "file writing information" 3S 
from the confirmed queue 432 (Step C3). 

Fig. 9 is a flow-chart showing the processing flow 
in the case of a fault such as an abort having occurred 
in the process 35, and the process 35 being re-executed 
on the primary computer 30 from the last checkpoint ac- 40 
quisition. When a fault occurs in the process 35, the 
checkpoint information restoration unit 33 in the primary 
computer 30 is instructed to "restore address space and 
processor context" (Step D1 ). Next, the primary file res- 
toration unit 37 is instructed to "restore primary file" 4S 
(Step D2). 

Fig. 10 is a flow-chart showing the processing flow 
in the case of the checkpoint informatbn restoration unit 
33 in the primary computer 30 having been instructed 
to "restore address space and processor context". At so 
first, the address space of the process 35 is restored 
(Step El). Then, the state of the processor context at the 
time of checkpoint acquisition of the process 35 is re- 
stored (Step E2). 

Fig. 1 1 is a flow-chart showing the processing flow ss 
in the case of the primary file restoration unit 37 having 
been instructed to "restore prinnary file". At first, it checks 
whether or not the "file writing information" is linked to 



the pending queue 431 (Step Fl). If the "file writing in- 
formation" is so linked (Y branch of Step Fl), data cor- 
responding to the file writing informatbn" which has 
been linked to the pending queue 431, and has been 
updated in a corresponding location in the primary file 

39, is read from the back-up file 41 . The primary file 39 
is then restored with the data corresponding to the file 
writing information" that has been read from the backup 
file 41 by writing it to the prinnary file 39 (Step F2). Then, 
the "file writing infornrtation" which was used in the res- 
toratton is removed from the pending queue 431 (Step 
F3). This process is repeated until the primary file 39 
has been updated by all data corresponding to the file 
writing informafion" linked to the pending queue 431. 

In the case of a fault, such as "system down", oc- 
curring in the primary computer 30 or the operating sys- 
tem which controls the primary computer 30, the proc- 
ess 35 is re-executed on the back-up computer 40 from 
the last acquired checkpoint. In this case, the process 
continues in the back-up file 41 . 

Fig. 1 2 shows a state in which the process is con- 
tinued in the back-up file 41 after such a fault has oc- 
curred. In this example, the process 35 which is execut- 
ing on the primary computer 30 has written "X" to the 
position of first byte at a time tl in a duplicated file, e.g., 
the prinnary tile 39 on the primary computer 30 and the 
back-up file 41 on the back-up computer 40 which stores 
4 bytes of data "ABCD". respectively (1). In this exam- 
ple, the primary file 39 is immediately updated, but the 
back-up file 41 saves only the file writing infomnation" 
without immediately updating. 

After this, a fault occurs in the primary computer 30 
at a time t2 (2). In this case, the process 47 is re-exe- 
cuted on the back up computer 40 by using the check- 
point whbh has been acquired on the back-up computer 

40. At this time, the process 47 continues processing by 
using the back-up file 41, Although the prinnary file 39 
was updated at the time tl, the back-up file 41 has not 
yet been updated. Consequently, during reexecution of 
the process 47 on the back-up computer 40, the back- 
up file 41 can be used as it stands. 

In the case of a back-up file having been truncated 
through the occun^ence of a fault, the initial state such 
as shown in Fig. 1 , can be reproduced afterwards by 
producing a new back-up file. Recovery processing is 
possible, even rf the fault occurs again. 

In the case of processing being continued with a 
backup file due to the occun-ence of a fault, and the 
processing being re-executed from a checkpoint, after- 
wards, the initial state as shown in Fig. 1 can once again 
be reproduced by taking the back-up file as the primary 
file and producing a new back-up file. The recovery 
process is possible, even if the fault reoccurs. Three 
methods of producing a new back-up file are now pre- 
sented. 

(1 ) In the case of the back-up file being reconnected 
by saving the primary file update informatbn and 
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data after truncating the back-up file, that primary 
file update information and data after truncation are 
reflected in the back-up file. 

(2) The primary file is copied to the back-up file. 
However, in the case of the primary file being up- 
dated during copying also, the primary file update 
information and data are reflected in the back-up 
file at the same time as copying starts. 

Moreover, the following method which is a com- 
binatbn of these two methods is also effective. 

(3) Taking the re-connection of the back-up file 
which was truncated (or the primary file before the 
occurrence of the fault) as a prerequisite, the pri- 
mary file update information and data after the 
back-up file has been truncated is saved until a 
specified time has elapsed, as if method (1) were 
adopted. After the specified time has elapsed, 
method (1 ) is abandoned, and the saving to the pri- 
mary file update information and data after the trun- 
catbn of the back-up file stops, so that method (2) 
can be adopted. Also, in the case of reconnecting 
files other than the truncated back-up file, the sav- 
ing of the primary file update information and data 
after the truncation of the back-up file is stopped, 
and method (2) is adopted. 

Second embodiment: 

The following is a description of a second preferred 
embodiment of the present invention. In the first embod- 
iment, a duplicated computer system was described. 
However, the present invention is effective even applied 
to file systems on computers which are not duplicated. 
Thus, in the second embodiment, the present invention 
as applied to a file system on a single computer is ex- 
plained in relation to Fig. 1 3. 

In Fig. 13, there is only one computer 30. Process 
35 is executed on the computer 30, and updates a file 
which is duplicated as a primary file 39 and a back-up 
file 41 . The primary file 39 and the back-up file 41 are 
both provided in the computer 30, and they are updated 
through a file system 36. 

The file system 36 contains a primary file operating 
unit 38, a primary file restoration unit 37, a back-up file 
operating unit 43. a pending queue 431, a confirmed 
queue 432, a back-up file updating unit 44 and a primary 
file restoration information reading unit 42. 

When process 35 updates the duplicated files, it 
does so through the primary file operating unit 38 and 
the back-up file operating unit 43. When process 35 per- 
forms a "write" operatk>n to the duplicated files, the pri- 
mary file 39 is updated as it stands. However, the back- 
up file 41 is not updated, and the file writing information" 
is saved in the pending queue 431 through the back-up 
file operating unit 43. 

Also, when process 35 acquires checkpoints, a 
checkpoint control unit 31 issues checkpoint acquisition 
instructions to a checkpoint information saving unit 32 



and a primary file operating unit 38. When the check- 
point information saving unit 32 receives a checkpoint 
acquisition instruction, it acquires the checkpoint infor- 
mation 34 consisting of address space and processor 
5 context in the computer 30. 

At the same time, when the primary file operating 
unit 38 receives the checkpoint acquisition instruction, 
any "file writing information" which has been saved in 
the pending queue 431 is shifted to the confirmed queue 
10 432 through the back-up file operating unit 43. The "file 
writing infomiation" which has been shifted to the con- 
firmed queue 432 is used by a back-up file updating unit 
44 for updating the back-up file 41 after checkpoint ac- 
quisitk)n, and is dumped after the updating of the back- 
us up file 41 , By doing this, a "write" operation is performed 
for the back-up file 41 in the same way as that performed 
prevrausly for the primary file 39. 

In the case of a fault such as an "abort" occurring, 
and the process 35 being re-executed from the last ac- 
quired checkpoint, the address space and the processor 
context are restored by the checkpoint information res- 
toratbn unit 33 in the computer 30, 

Concerning the files, for the back-up file 41 , the up- 
dating from the checkpoint onward is still only saved as 
file writing information" in the pending queue 431 , and 
restoration is not required since it has not actually been 
updated. However, restoration is required for the prima- 
ry file 39 since updating has already been performed 
from the checkpoint onward. Therefore, the data repre- 
sentative of data in the prinnary file 39 before updating 
is read from the back-up file 41 based on the "file writing 
information" saved in the pending queue 431 . This pre- 
update data restores the primary file 39 by being written 
to the primary file 39. After writing to the primary file 39, 
the "file writing information" which is saved in the pend- 
ing queue 431 is dumped. When the "file writing infor- 
mation" is saved in the confirmed queue 432. the above- 
mentioned restoratbn process starts after updating the 
back-up file 41 in accordance with the confirmation 
queue "file writing information". 

Fig. 1 4 shows the schematic construction of a com- 
puter system to which the second embodiment is ap- 
plied. The undupltcated computer 30 is coupled to a disk 
device 60a and a disk device 60b. Process 35 is exe- 
cuted in the computer 30. The process 35 accesses the 
duplicated files of a primary file 39 and a back-up file 41 
which are respectively provided in the disk device 60a 
and the disk device 60b. 

By applying the present invention in this way, exe- 
cution continues while periodically saving the states of 
the process address space and the processor context 
as checkpoint information. When a fault occurs, the 
process is re-executed from the last acquired check- 
point. In a system in whk;h such countermeasures 
against faults are taken, faults are more easily recov- 
ered from, and file updating performance is substantially 
improved. 

The file management method stated in the above 



25 



30 



35 



40 



45 



50 



7 



13 



EP 0 827 079 A1 



14 



embodiments can be installed and distributed on floppy 
disks, optical disks, semiconductor memories, etc.. as 
programs to be executed on computers. 

As described above in detail, according to the 
present invention, when a process requires updating of 
a file, updating information ("file writing information") is 
saved and, at the same time, only the primary file is im- 
mediately updated. After a checkpoint has been ac- 
quired, the updating content shown by that saved up- 
date information is caused to be reflected in the back- 
up file. Then, when the process aborts, all the pre-up- 
date data which corresponds to the data which has been 
updated since the last acquired checkpoint are read 
from the back-up file, based on the saved updating in- 
forrT»atk>n. The prinnary file is restored to Its state at the 
time of the checkpoint by using the preupdate data read 
from the backup file, and re-execution of the process 
starts. It is also possible to start the re-executkxi of a 
process by using the back-up file. 

In the computer system according to the present in- 
vention, it is possible to achieve recovery of a file at the 
time of a fault without delaying normal processing for 
completion of a process such as the reading out and 
sidetracking of pre-update data when updating a file. 
Thus, it is possible to very significantly improve file up- 
dating perfomnance without loss of reliability of the com- 
puter system. 

Obvrausly, numerous modifications and variatbns 
of the present invention are possible in light of the above 
teachings. It is therefore to be understood that within the 
scope of the appended claims, the invention may be 
practiced otherwise than as specifically described here- 
in. 



4. A duplicated computer system according to claim 3 
and comprising: 

a working computer system including a working 
5 file system; 

a standby computer system including a standby 
file system; 

a checkpoint infornDatbn unit for performing 
checkpoint acquisition by collecting checkpoint 
information needed for re-starting an applica- 
tion process on said working computer system 
at a checkpoint and saving sakJ checkpoint in- 
formation in both said working computer sys- 
tem and said standby computer system; and 
an update process executed on said working 
computer system for updating both the working 
file system and the standby file system; 

wherein; 

said update process, upon receipt of a file write 
command initiated by said application process, 
updates said working file system in accordance 
with said file write command, saves file update 
informatbn pertaining to said file write com- 
nnand on said standby computer system, and 
notifies said appltcatbn process upon comple- 
tion of saving said file update infonmation and 
updating said working file system, and 
said update process, upon receipt of a check- 
point command, updates said standby file sys- 
tem in accordance with said file update infor- 
mation saved on said standby computer. 
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Claims 

1 . A computer system incorporating checkpoint secu- 
rity, in which process checkpoints are periodically 
collected and saved in order to enable the process 
to be re-started in the event of an interuptkxi, char- 
acterised in that the process working files are dupli- 
cated to provide a corresponding standby file sys- 
tem, and in that, when an update is applied to the 
working files during normal operation, the update 
information is not applied immediately to the corre- 
sponding standby files, but is retained separately 
for application to the standby files at the time of col- 
lection of the next checkpoint. 

2. A computer system according to claim 1 character- 
ised in that the working files and the standby files 
are noaintained on separate disk systems. 

3. A computer system according to claim 1 or claim 2 
characterised in that the working file system and the 
standby file system are maintained on separate 
computer systems. 



35 5. The duplicated computer system according to claim 
4, wherein said working computer system includes 
an update infomnation buffer for buffering said file 
update infonriatkjn on said working computer sys- 
tem and batch-transmitting said file update infonma- 

^ tion to said standby computer system until a next 
checkpoint acquisition occurs. 

6. The duplicated computer system according to claim 
4 or claim 5, further comprising a restoration unit 

45 that, after interruptbn of said applk:ation process, 
reads pre-update data corresponding to said file up- 
date information saved on said standby file system, 
restores said working file system to a state at a last 
checkpoint by writing said pre-update data to said 

50 working file system, and re-executes said applica- 
tion process from the last checkpoint. 

7. The duplicated computer system according to any 
of claims 4 to 6. further comprising a backup file op- 

55 erating unit that, after interruption of said application 
process, or a fault in the working file system, the 
working computer system, or its operating system, 
deletes any update information whk:h has been 
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saved from a last checkpoint onward, updates said 
standby file systenn according to file update infor- 
mation saved prior to the last checkpoint, and re- 
executes the application process on said standby 
computer system from said last checkpoint. s 

8. The duplicated computer system according to any 
of claims 4 to 7 wherein, when a fault occurs said 
standby file system, said standby computer system. 

or the operating system which controls said standby io 
computer system, said checkpoint infomiation unit 
stops saving said checkpoint infomr^tkxi to said 
standby computer system and said update process 
stops saving said file update infomnation to said 
standby computer system. is 

9. The duplicated computer system according to any 
of claims 4 to 8, further comprising: 

a third computer system; and 20 
a backup file operating unit that, upon a trunca- 
tion of said standby file system, copies said 
standby file system onto said third computer 
system. 

25 

10. The duplicated computer system according to any 
of claim 4 to 9, further comprising a primary file res- 
toration unit that provides a current copy of said 
standby file system on said working system when 
re-execution of said applicatbn process Is to be per- 
formed from a last checkpoint by copying sakJ 
standby file system to said working computer sys- 
tem. 

11. A duplrcated computer system, comprising: 35 

a working system including a working system 
computer; 

a standby system including a standby system 
computer; 40 
a checkpoint control unit for collecting check- 
points including Information for re-starling proc- 
esses executed on said working system com- 
puter, and saving said checkpoints on the com- 
puters of both said working systems and said ^ 
standby system; and 

a file management system which nnaintains 
files to be updated by said processes executed 
on said working system computer In duplicate 
on both saki working system computer and said so 
standby system computer, updates a working 
file on said working system computer according 
to file update commands from said processes 
executed on said working computer system, 
saves file update Information according to said ss 
file update commands on the sakJ standby sys- 
tem computer, notifies each of said processes 
originating a file update command upon com- 



pletion of the update to the working file and sav- 
ing of file update information to the standby sys- 
tem computer, and updates a standby file ac- 
cording to the file update information saved to 
the standby system computer after a check- 
point Is collected by said checkpoint control 
unit. 

!. A computer system comprising: 

a working file system and a standby file system; 
a checkpoint information unit for performing 
checkpoint acquisition by collecting checkpoint 
informatk>n needed for re-starting an applica- 
tion process on said computer system at a 
checkpoint and saving said checkpoint Infor- 
maXion In both saki working file system and said 
standby file system; and 
an update process executed on said computer 
system for updating both the working file sys- 
tem and the standby file system; and 
a restoratran unit for restoring said working file 
system upon an Inten-uptbn of sakJ application 
process, and restarting said application proc- 
ess from a last acquired checkpoint; 

wherein: 

said update process, upon receipt of a file write 
command initiated by said applicatton process, 
updates said working file system in accordance 
with said file write command, saves file update 
infomnation pertaining to said file write com- 
mand on sakJ standby file system, and notifies 
said application process upon completion of 
saving said file update information and updat- 
ing said working file system; 
sakJ update process, upon receipt of a check- 
point command, updates said standby file sys- 
tem in accordance with said file update Infor- 
mation saved on said standby computer. 

. A method of managing files on a duplicated compu- 
ter system Including a working computer system, a 
working file system, a standby computer system, 
and a standby file system, comprising the steps of: 

collecting checkpoints each including informa- 
tion needed for re-starting an application proc- 
ess at a corresponding checkpoint on said 
working computer system; 
processing a file write command Issued from 
said application process executing on said 
working computer system, Including the sub- 
steps of; 

updating sakJ working file system according to 

said file write command, and 

saving file update infornr^ation corresponding to 
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said file write command to said standby com- 
puter system; 

updating said standby file system according to 
saved file update information upon collection of 
a checkpoint; s 
restarting an application process on said work- 
ing computer system without loss of informa- 
tion, upon occurrence of an interruption of said 
application process, including the sub-step of; 
rolling back a state of said working computer io 
system to a last checkpoint, comprising the 
substeps of; 

reading pre-update data from said standby file 
system in accordance with said file update in- 
fomnatbn saved to said standby computer sys- 
tem, said pre-update data corresponding to da- 
ta written to said working file system since sad 
last checkpoint; 

writing said pre-update data to said working file 
system, thereby restoring said working file sys- 20 
tem to a state corresponding to said last check- 
point, and 

removing said file update infomnation corre- 
sponding to file write commands since said last 
checkpoint from said standby computer sys- 2S 
tem. 
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